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Abstract. We present an efficient algorithm for finding all approximate 
occurrences of a given pattern p of length m in a text t of length n allow- 
ing for translocations of equal length adjacent factors and inversions of 
factors. The algorithm is based on an efficient filtering method and has an 
0(nmmax(a,/3))-time complexity in the worst case and C(max(a, /3))- 
space complexity, where a and f3 are respectively the maximum length 
of the factors involved in any translocation and inversion. Moreover we 
show that under the assumptions of equiprobability and independence of 
characters our algorithm has a 0(n) average time complexity, whenever 
a — i7(logm/loglog 1_e m), where e > and a is the dimension of the 
alphabet. Experiments show that the new proposed algorithm achieves 
very good results in practical cases. 

1 Introduction 

Retrieving information and teasing out the meaning of biological sequences are 
central problems in modern biology. Generally, basic biological information is 
stored in strings of nucleic acids (DNA, RNA) or amino acids (proteins). Align- 
ing sequences helps in revealing their shared characteristics, while matching se- 
quences can infer useful information from them. With the availability of large 
amounts of DNA data, matching of nucleotide sequences has become an impor- 
tant application and there is an increasing demand for fast computer methods 
for analysis and data retrieval. 

Approximate string matching is a fundamental problem in text processing 
and consists in finding approximate matches of a pattern in a string. The close- 
ness of a match is measured in terms of the sum of the costs of the edit operations 
necessary to convert the string into an exact match. Most classical models, e.g., 
Levenshtein or Damerau edit distance (for a survey see [5]) assume that changes 
between strings occur locally. However, evidence shows that large scale changes 
are possible in chromosomal rearrangment. For example, large pieces of DNA in 
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Fig. 1. An example of chromosomal inversion. 

a chromosomal sequence can be broken and moved from one location to another. 
This is known as a chromosomal translocation. Sometimes a mutation can also 
flip a stretch of DNA within a chromosome, producing a chromosomal inversion. 

In particular a chromosomal inversion is a rearrangement in which a segment 
of a chromosome is reversed end to end. An inversion occurs when a single chro- 
mosome undergoes breakage and rearrangement within itself. Fig. 1(A) shows 
an example of chromosomal inversion. 

Differently a chromosomal translocation is a chromosome abnormality caused 
by rearrangement of parts of the same chromosome or between nonhomologous 
chromosomes. Sometimes a chromosomal translocation could join two separated 
genes, the occurrence of which is common in cancer. Fig. 1(B) shows an example 
of chromosomal translocation. 

Both inversions and translocations do not involve a loss of genetic informa- 
tion, but simply rearrange the linear gene sequence. 

Recently Cantone et al. [2] presented the first solution for the matching prob- 
lem under a string distance whose edit operations are translocations of equal 
length adjacent factors and inversions of factors. In particular, they devised 
a 0{nmmax(a, /3))-time and 0(m 2 )-space algorithm, where a and j5 are the 
maximum length of the factors involved in a translocation and in an inver- 
sion, respectively. They showed that under the assumption of equiprobability 
and independence of characters in the alphabet, on average the algorithm has a 
0{n\og a m)-time complexity. Moreover they also presented a bit-parallel imple- 
mentation of their algorithm, which has C(nmax(a, /3))-time and 0(cr+TO)-space 
complexity, if the pattern length is comparable with the computer word size. 

In this paper we present a new algorithm for the same problem based on 
an efficient permutation filtering method and on a dynamic programming ap- 
proach for testing candidate positions. In particular our algorithm achieves a 
0(nmmax(a, /3))-worst case time complexity, as the M-Sampling algorithm, 
and requires only C(max(a, /?)) space. Moreover we show that under the as- 
sumption of equiprobability and independence of characters in the alphabet, on 
the average our filter based algorithm achieves a 0(n)-time complexity, when 
a = J?(log m/ log log 1_£ m), where e > and a is the dimension of the alphabet. 

A slightly shorter version of this manuscript was submitted to Information 
Processing Letters. 



2 Basic notions and definitions 



Let p be a string of length to > 0, over an integer alphabet E of size er. We 
represent it as a finite array p[0 . . . m—1] of characters from S and write len{p) = 
m. In particular, for m = we obtain the empty string e. We denote by p[i] the 
(i + l)th character of p, for < i < to. Likewise, the substring (also called factor) 
of p contained between the (i + l)th and the (j + l)th characters of p is indicated 
with p[i . . . j], for < i < j < to. An m-substring (or m-factor) is a substring 
of length to. We also put pi = Def p[0 . . . i], for < i < to. In addition, we write 
pp' to denote the concatenation of p and p', and p r for the reverse of the string 
p, i.e., p r = Dof p[m — l]p[m — 2] . . .p[0]. Given a string p and a character c£ S, 
we define occ p (c) as the number of times the character c occurs in p (observe 
that < occp(c) < len(p)). 

A distance d : 17* x Z 1 * — > R is a function which associates to any pair of 
strings X and Y the minimal cost of any finite sequence of edit operations which 
transforms X into Y, if such a sequence exists, oo otherwise. 

Definition 1. Given two strings X and Y , the mutation distance md(X 1 Y) is 
based on the following edit operations: 

(1) Translocation: a factor of the form ZW is transformed into WZ, provided 
that len(Z) = len(W) > 0. 

(2) Inversion: a factor Z is tranformed into Z' . 

Both operations are assigned unit cost. □ 

We indicate with a and [3 the maximum length of factors involved translo- 
cations and inversions, respectively. By definition, a < [len(X)/2\ and f3 < 
len(X). When md(X, Y) < oo, we say that X and Y have an md-match. Addi- 
tionally, if X has an mcZ-match with a suffix of Y, we write X ^ m d Y. 

3 Proposed Algorithm 

In this section we present a new efficient algorithm for the approximate string 
matching problem allowing for inversions of factors and translocations of equal 
length adjacent factors. In the following we assume that p and t are strings of 
length to and n respectively, over a common alphabet S = {c , . . . , c a -{\, where 
a = 0(n). (The case of even larger alphabets is rather theoretical and can be 
handled with standard solutions, e.g., using a minimal perfect hash function.) 

The new algorithm, named GFG algorithm, searches for all occurrences of 
p in t by making use of an efficient filter method. This technique, usually called 
as the counting filter, is known in the literature [3, 4, 1] and has been used for 
fc-mismatches and fc-diffcrcnccs. The idea behind the filter is straightforward 
and is based upon the observation that (in our problem) if the pattern p has 
an approximate occurrence (possibly involving inversions and translocations) 
starting at position s of the text then the m-substring of the text t[s . . . s+m— 1] 
is a permutation of the pattern. 



Then the GFG algorithm identifies the set r Ptt of all candidate positions 
s in the text such that the substring t[s . . . s + to — 1] is a permutation of the 
characters in p and, for each s <G -T Pjt , executes a verification procedure in order 
to check the approximate occurrence. 

Before entering into details we need to introduce some additional notations. 
Given two strings w and z, we define a distance function S(w, z) as 

8{w,z) = ^ abs(occ w (c) ~occ z (c)). 

Obviously, if len(w) = len(z), then S(w, z) = iff w is a permutation of z. 

For each position s in the text, with < s < n — to, we define a function 
G s : £ -)• N, as 

G s (c) = occ p (c) - occ t(s ^ m) (c) 

for c S, and where we set t(s, m) = t [s . . . s + m — 1] . 

Finally we define, for each position s, the distance value 5 S as follows 

S s = 5(p,t s ) = ^2 abs(ocCp(c) - occ t(s ^ m) (c)) = ^ abs(G s (c)). 

c££ c££ 

Then the set r p ^ of all candidate positions in the text can be defined as 

r p ,t — {s\0<s<n — m and S s = 0}. 

Observe that values 5 s +i and S s can differ only in the number of occur- 
rences of characters t[s] and t[s + m]. Specifically we have occ t ( s+1 , m )(t[s\) > 
occt( s ,m){t[s])-l and occ t ( s+ i tm ){i[ s + m \) < occ t ( StTn )(t[s + m}) + 1. Moreover, if 
t[s] = t[s + m], the two functions occ t ( s +i, m ) ancl occ t ( s m ) do not differ for any 
value. 

Therefore, for each character c e S, the value of G s+ i(c) can be computed 
in constant time from G s (c) by using the following relation 

( G s (c) - 1 if c = t[s + m] j=t[s] 
G s+ i(c) = Ig s (c) + 1 if c - t[s] ^ t[s + m] 
[ G s (c) otherwise 

which gives the following relation for computing 5 s+ i from S s in constant time 

S s+1 =S S - abs(G s (t[s})) - abs(G s (t[s + m])) + 
+ abs(G s+1 (t[s})) + abs(G s+1 (t[s + m})). 

Fig. 2 shows the pseudocode of the GFG algorithm (on the left) and the 
verification procedure (on the right). Note that the main loop of GFG has only 
one conditional and the integer abs function is translated by modern compilers 
(including GNU C Compiler) into branchless code. 

The verification procedure is based on dynamic programming. The algorithm 
uses two matrices, F and /, both of size to 2 , in order to compute occurrences of 





Verify (p, m, t, s, a, f3) 


1. for c G 23 do G[c] <— 


l 




2. for s ^ — to 'fn — 1 do 


2 


f o r % ^ to in X d o 




3. 


for j max(0, i — 7) to min(m — 1, i + 7) do 




4. 


F[i, j] <- /[i, m - j - 1] <- 


5. 5 <- 


5. 


if (p[i] = tfs+j]) then 


6. for c £ £ do 5 <— <5 + abs(G[c]) 


6. 


if (j = or j = 0) then F[i,j] <- 1 


7. for s{-flton-m do 


7. 


else F[i,j] <- F[i - l,j - 1] + 1 


8. if 6 = then 


8. 


if (p[i] — t[s + m — j — 1]) then 


9. Verify(p, m, t, s, a, f3) 


9. 


if (i — or j — 0) then /[i, m — j — 1] <— 1 


10. a <- t[s] 


10. 


else /[z, m — j — 1] <— /[« — 1, m — j] + 1 


11. b i- t[s + m] 


11. 


if (p[i] = t[ s + j] and (i = or S[i - 1] = 1)) 


12. 6 <- 6 - abs(G[a]) - abs(G[b]) 


12. 


then i- 1 else <- 


13. G[a] <- G[a] + 1 


13. 


for fc <— 1 to min(ct, L^^J) do 


14. G[b] <- G[fc] - 1 


14. 


if (F[i, i - fc] > fc and F[i - fc, i] > k) then 


15. 5 <- 5+ abs(G[a]) + abs(G[b}) 


15. 


if (i < 2k or S[i - 2k] = 1) then S[i] <- 1 


16. if S = then 


16. 


for fe <- 2 to min(/3, i + 1) do 


17. VERWY(p,m,t,n ~ m,a, /3) 


17. 


if i - k + 1] > fe) then 




18. 


if (i < fe or S[i - fe] = 1) then S[i] <- 1 




19. 


if (5[m - 1] = 1) then Output(s) 



Fig. 2. (on the left) The GFG algorithm for the approximate string matching problem 
with inversions and translocations and (on the right) the verification procedure. 



factors and inverted factors of p, respectively, in the substring t[s . . . s + to — 1]. 
More formally we define 

F[i,j] — max{fc | p[i — k + 1 . . . i] = t[s + j — k + 1 . . . s + and 
I[i,j] = maxjfc | p[i - k + 1 . . . i] = (t[s +j...s + j + k- l]) r } 

for < i < m and max(0, i — 7) < j < min(m — 1,2 + 7), where 7 = min(a, (5). 
Moreover a vector £, of size to, is maintained in order to compute the md- 
matches of all prefixes of the pattern in t[s . . . s + m — 1]. More formally, for 
< i < to, we have S[i] = 1 if pi ^ m d t[s . . .s + i] and S[i] = otherwise. 
The following recursive relations are used for computing F and /. 

if p\i] ^ t[s + j] 
F[i,j] = { F[i - l,j - 1] + 1 if i > 0,j > max(0,i - a) and p[i] = t[s + j] 

1 otherwise 

x p[t\ t[s + j] 

I[i,j] = { I[i - l,j + 1] + 1 if i > 0,j < min(TO - l,i + /3) and p[i] = t[s + j] 

1 otherwise 

Finally the vector S is computed, for increasing i — 1 . . . to — 1 [S[i] is set to 
0) according to the following (recursive) formula. The value of S[i] is set to 1 iff 
one of the following conditions holds: 

- p[i] = t[s + i] and (i = or S[i - 1] = 1); 

- F[i,i- k] > k, F[i - k,i] > k and (i < 2k or S[i - 2k] = 1), for 1 < k < 
min( a ,L^J); 

- I[i,i-k + i\ > k and (i < k or S[i - k] = 1), for 1 < k < min(/3,i + 1). 



Then p has an m<i-match starting at position s of the text if S[m — 1] = 1 at 
the end of the verification procedure with parameter p, t and s. 

Observe that for computing the entry of position i in S only the last (3 
entries of the (i — l)th row of I are needed, while only the last a entries of the 
(i — l)th row of F and of the (i — l)th column of F are needed. Similarly only 
the last max(2a, (3) entries of the vector S are needed for computing the value 
S[i]. Moreover, both for I and F, the computation of the ith row (column) needs 
only the values in the (i — l)th row (column) of the matrix. 

It is thus straightforward to reduce the space requirements of the verification 
phase to C(max(a, /?)). This is done by maintaining, for each iteration, only two 
rows of / and only two rows and two columns of F, each of size max(ev, /3). 

The verification time and space costs are thus 0(m max(a, /?)) and 0(max(a, /?)), 
respectively, leading to overall C(nmmax(a, /3)) worst case time complexity and 
£>(max(a, f3, a)) space complexity for the GFG algorithm. 



4 Average Case Time Analysis 

Next, we evaluate the average time complexity of the GFG algorithm. In our 
analysis we assume the uniform distribution and independence of characters. 
We first assume that m = cj(a°^), Then we prove the more simple case when 
m < g. 

Our verification procedure takes 0(m 2 ) (worst-case) time per location. To 
obtain linear average time, we must thus bound the probability of having per- 
muted subsequences of length m with C(l/m 2 ). We will find conditions upon 
which this happens. 1 

Suppose m = u}(a°^), we define k — m/cr and, without loss of generality, 
we assume that a divides m. For each text position s, with < s < n — m, 
the probability that the m-substring of the text, beginning at position s, is a 
permutation of the pattern p is exactly 

/ m \ /m—occ(co)\ (m—occ(cQ) — occ(c-i)\ /occ(c CT _i)\ 

p r p 1 _ \OCC(C )J V OCc(ci) )\ 0CC(C 2 ) I ' ' ' \OCc(c„_l)/ / -I \ 

Now, it is easy to notice that the probability given in (1) is maximized when 
occ(ci) = k for all i. We can thus write: 

lm\ lm-k\ (m-2k\ lk\ i 

VrU^r i < UK k A k LiiUj _ to! 

We make use of Stirling's approximation for both m! and k\, recalling that 
k = m/a: 



i\ _ ( v / 2^(m/e) m \ _ | 



1 The paper [1] contains an analysis of the counting filter, in the fc-differences problem. 
Unfortunately, the analysis seems to be flawed, which was admitted in discussion by 
the second author of the cited paper (G. Navarro). 



Let us upper-bound v27r/(v27r)°' with 1 and remove it. We have: 

o°' 2 



e | — — | =e 



(<7-l)/2 



Hi 



Let us assume to > cr 4 (we recall that m = ^(er ^)). Then «W 2 / to(<7_1)/2 
is less than or equal to l/cr 1 - 5 ' 7-2 . Note that if we take a larger lower bound on 
to, e.g., cr 8 , then our upper bound gets even smaller, namely l/cr 3 - 5<T ~ 4 in this 
example. All in all, we have 

Pr{.s e r Pit } = 0(l/a ^) = 0{l/m 2 ) 

for any a = J7(logm/loglog 1_£ m), where e > 0. 

Suppose now that m < cr. 2 Then the probability that the m-substring of the 
text, beginning at position s, is a permutation of the pattern p is 

„ , „ , ml ml 
Pr{s G r Pit } < — < — 

If we make use again of Stirling's approximation for ml, we obtain 

Pr{s € r p t } < V2^^—- = V^k— = 0{l/m 2 ). 

Thus the overall average time complexity of the GFG algorithm, assuming 
a = ^(logTO/loglog 1 ^ 6 m), is given by the following relation: 



1 (n, m,a) 

= 0((7 + m) + (n - m + 1) • C(1/to 2 ) • 0(m 2 ) = 0(n). 



5 Experimental results 

In this section we evaluate the performance of the following algorithms: 

- The M- Sampling algorithm [2] (MS) 

- The GFG algorithm using the M-Sampling algorithm for verification (GFG1) 

- The GFG algorithm as shown in Fig.2 (GFG2) 

All algorithms have been implemented in C and compiled with the GNU C Compiler 
4.2, using the options -02 -f no-guess-branch-probability. All tests have 
been performed on a 2 GHz Intel Core 2 Duo and running times have been mea- 
sured with a hardware cycle counter, available on modern CPUs. We used the 
following input files: 

2 Note that for the more general case of m = there exists already an average- 

case linear algorithm [2], so this part of the analysis is only to find properties of the 
currently presented algorithm. 



(i) four random texts of 2, 000, 000 characters with a uniform distribution over 
alphabets of dimension cr, with a € {4, 8, 16, 32} respectively, 

(ii) a protein sequence of 2,900,352 characters from the Saccharomyces cere- 
visiae genome (with a = 20), 3 

(iii) a genome sequence of 4, 638, 690 base pairs of Escherichia coli (with a = 
4). 4 

For each input file, we have generated seven sets of 200 patterns of fixed length 
to randomly extracted from the text (with at least one occurrence in the text), 
for to ranging over the values 8, 16, 32, 64, 128, 256, 512. For each set of patterns 
we reported the mean time over 200 runs, expressed in milliseconds. 



Random text with a — 


4 


m 


MS 


GFG1 


GFG2 


8 


254.78 


48.53 


73.73 


16 


350.25 


50.05 


103.09 


32 


441.05 


44.20 


102.04 


64 


528.35 


43.83 


140.18 


128 


645.36 


43.20 


208.05 


256 


868.13 


41.84 


273.47 


512 


1273.13 


44.71 


349.57 




Random text with a — 


32 


m 


MS 


GFG1 


GFG2 


8 


93.80 


28.18 


28.52 


16 


110.64 


28.20 


28.53 


32 


128.80 


28.25 


28.55 


64 


169.25 


28.42 


28.61 


128 


197.24 


28.65 


28.93 


256 


259.77 


29.45 


30.23 


512 


398.20 


32.07 


35.11 



Random text with a — 8 


m 


MS 


GFG1 


GFG2 


8 


155.39 


29.57 


29.78 


16 


193.91 


29.21 


28.98 


32 


241.54 


29.20 


28.72 


64 


309.26 


29.33 


28.75 


128 


377.17 


29.68 


29.16 


256 


525.96 


30.75 


30.89 


512 


770.45 


34.14 


37.73 




Escherichia coli 


m 


MS 


GFG1 


GFG2 


8 


593.49 


117.79 


184.48 


16 


781.76 


108.53 


208.50 


32 


976.79 


99.88 


222.19 


64 


1188.58 


94.64 


267.01 


128 


1484.03 


84.16 


252.17 


256 


2005.00 


80.40 


257.70 


512 


2929.90 


83.36 


299.49 



Random text with a — 


16 


m 


MS 


GFG1 


GFG2 


8 


115.27 


28.45 


28.55 


16 


137.27 


28.48 


28.54 


32 


161.25 


28.51 


28.57 


64 


211.75 


28.65 


28.66 


128 


273.53 


28.94 


29.01 


256 


371.65 


29.86 


30.34 


512 


536.40 


32.85 


35.79 




Saccharomyces cerevisiae 


m 


MS 


GFG1 


GFG2 


8 


163.25 


41.38 


41.45 


16 


192.64 


41.39 


41.45 


32 


224.27 


41.44 


41.48 


64 


297.01 


41.56 


41.60 


128 


376.27 


41.88 


41.91 


256 


506.88 


42.79 


43.25 


512 


738.19 


45.72 


48.65 



The experimental results show that the filtering strategy is quite effective 
and allows to dramatically speed up, by a factor of at most 30, the computation 
of the TOci-matches of a given pattern. It is worth observing that for very small 
alphabets the GFG1 algorithm, based on M-Sampling, is faster than the GFG2 
algorithm, based on the dynamic programming verification, while in the other 
cases the two algorithms have almost the same speed. In the following tables 
we report the mean, over the 200 runs, of the number of pattern's permutations 
found per text position. 



Random text (cr 


= 4) 


m 


# candidate 


8 


0.013621 


16 


0.006399 


32 


0.001837 


64 


0.000720 


128 


0.000285 


256 


0.000093 


512 


0.000029 



Random text (cr 


= 8) 


m 


# candidate 


8 


0.000410 


16 


0.000037 


32 


0.000004 


64 


0.000001 


128 


0.000001 


256 


0.000001 


512 


0.000001 



Random text (a — 16) 



# candidate 



16 

.32 

64 

128 

256 

512 



0.000004 
0.000001 
0.000001 
0.000001 
0.000001 
0.000001 
0.000001 



Average number of candidate positions for each text character on random texts 
with cr = 4 (on the left) cr = 8 (in the center) a — 16 (on the right) 

Observe that, while for small alphabets the number is non negligible also for 
long patterns, for large enough alphabets it is always insignificant. 

3 http: //data- compression. inf o/Corpora/ProteinCorpus/ 

4 http : // corpus . canterbury . ac .nz/ 
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